Exploratory Analysis of Wines

Posted on Mar 15 janvier 2019 in Data Analysis

Wine Review Analysis: Worldwide & France

Data:

Source of Data Dimension: 10 columns and 150k rows of wine reviews.

Context:

The data was scraped from WineEnthusiast on November 22nd, 2017 by zackthoutt. WineEnthusiast collect wine reviews from a team of experienced wine taster, the dataset represent a set of wine reviews with ratings and description of the wine tasted.

Goal:

Exploratory Analysis of wines:

  • Distributions.
  • Influence of wine ratings over prices.
  • Most expensive variety.
  • Best variety rating.

Country: The country that the wine is from description

designation: The vineyard within the winery where the grapes that made the wine are from

points: The number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)

price: The cost for a bottle of the wine

province: The province or state that the wine is from

region_1: The wine growing area in a province or state (ie Napa)

region_2: Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank

variety: The type of grapes used to make the wine (ie Pinot Noir)

winery: The winery that made the wine

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import missingno as msno

import matplotlib as mpl
%matplotlib inline

import warnings
warnings.filterwarnings(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

# Customize Matplotlib
mpl.rcParams['font.size'] = 14
mpl.rcParams["figure.figsize"] = (10,8)
In [2]:
df = pd.read_csv('../input/winemag-data_first150k.csv',index_col=0)
In [3]:
df.head()
Out[3]:
country description designation points price province region_1 region_2 variety winery
0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley
3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
4 France This is the top wine from La Bégude, named aft... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude

1) Overall Exploration

Visualize nullity by column

In [4]:
msno.bar(df)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1184a1da90>

Country Distribution

In [23]:
sns.countplot(y="country", data=df, order=df.country.value_counts().iloc[:10].index)
plt.show()

Variety Distribution

In [24]:
sns.countplot(y="variety", data=df, order=df.variety.value_counts().iloc[:10].index)
plt.show()

TOP3 Variety by Country

In [7]:
groupby_variety = df.groupby(["country", "variety"])["variety"].count()

top_variety = groupby_variety.groupby(level=0, group_keys=False)
top3_variety = top_variety.nlargest(3)

df_top3_variety = top3_variety.loc[["US", "Italy", "France", "Spain","Chile"],:].to_frame()

df_top3_variety.columns = ["Frequency"]
df_top3_variety.index.names = ['Countries','Variety']
df_top3_variety.reset_index(inplace=True)

sns.catplot(x="Countries", y="Frequency", hue="Variety", data=df_top3_variety,
                height=7, kind="bar", palette="muted")
Out[7]:
<seaborn.axisgrid.FacetGrid at 0x7f118479c5c0>

Price Distribution

In [8]:
percentage_NaN_price = round(df.price.isnull().sum() / df.shape[0], 2)
print("Percentage of NaN Price: {}".format(percentage_NaN_price))

df.price.fillna(0, inplace=True)

ax = sns.distplot(np.log10(df.price + 1))
plt.title("Price Log distribution")
plt.show()
Percentage of NaN Price: 0.09

The majority of wine prices are between 10 and 100$(logarithm base 10).

Points Distribution

In [9]:
df.points.describe()
Out[9]:
count    150930.000000
mean         87.888418
std           3.222392
min          80.000000
25%          86.000000
50%          88.000000
75%          90.000000
max         100.000000
Name: points, dtype: float64
In [10]:
ax = sns.distplot(df.points)
plt.show()

WineEnthusiast only post wines with a range >= 80. The following scale rating is used:

  • 95-100 Classic: a great wine
  • 90-94 Outstanding: a wine of superior character and style
  • 85-89 Very good: a wine with special qualities
  • 80-84 Good: a solid, well-made wine

Wine reviews are based on blind tastings. Different editors cover the same region year to year. This rating is not a famous one, like Robert Parker or Revue du Vin de France.

50% of wine reviews are between 86-90 points. As we could imagine there is very small number of great wines >= 95.

In [11]:
points_percentage = df.points.value_counts(normalize=True) * 100
great_wines_percentage = round(points_percentage.loc[95:100].sum(), 2)
print("Proportion of great Wines>= 95: {}".format(great_wines_percentage))
Proportion of great Wines>= 95: 3.57

Influence of Wine Ratings over Price

In [25]:
ax = sns.boxplot(x="points", y="price", data=df)
plt.title("Wine Reviews vs Price")
plt.show()

subset_ratings_step5 = df.loc[df.points.isin([80, 85, 90, 95, 100]), ["price","points"]]
ax = sns.boxplot(x="points", y="price", data=subset_ratings_step5)
plt.ylim([0,1000])
plt.show()

subset_ratings_step2 = df.loc[df.points.isin([80, 82, 84, 86, 88, 90]), ["price","points"]]
ax = sns.boxplot(x="points", y="price", data=subset_ratings_step2)
plt.ylim([0,200])
plt.show()

We don't see a clear influence of low ratings <= 88 over prices, whereas high ratings influence clearly price.

Let's do an ANOVA Test to find relationships between ratings and prices.

ANOVA will find if there is a real difference between price means and ratings by testing if there is a significant difference between ratings.

In [13]:
#ANOVA
model = smf.ols(formula='points ~ price', data=df)
results = model.fit()
print("The ANOVA Test find a significant difference between these 2 variables. \n"
      "F-statistic: {} F-pvalue: {}".format(results.fvalue, results.f_pvalue))
The ANOVA Test find a significant difference between these 2 variables. 
F-statistic: 28167.758351969973 F-pvalue: 0.0

Most expensive variety

In [14]:
median_top20_variety = df.loc[:, ["variety", "price"]] \
         .groupby("variety") \
         .median() \
         .sort_values(by="price", ascending= False)  \
         .index[:20] \
         .values
plt.figure(figsize=(10,8))
ax = sns.boxplot(x="price", y="variety",
                 data=df.loc[df["variety"].isin(median_top20_variety),:],
                 order=median_top20_variety
                )
ax = sns.swarmplot(x="price", y="variety", 
                   data=df.loc[df["variety"].isin(median_top20_variety),:],
                   color=".25")
plt.title("Price Distribution by Variety")
plt.xlim([0,800])
plt.show()

I notice that there is some mistakes in the fied Variety. Some wine reviewers has labeled variety with:

  • A mix of grap variety like Cabernet Shiraz
  • Name of vineyards. A variety should be a variety of graps and that's it.

Very few wine reviews for some variety, high dispersion. we can't really say with precision which variety is the most expensive

Wineries with the biggest number of unique wines TOP20

In [15]:
top20_wineries = df.winery.value_counts(normalize=True).nlargest(20) \
.sort_values(ascending= False).index

plt.figure(figsize=(12,8))
sns.countplot(y="winery", data=df[df.winery.isin(top20_wineries)],
              order=top20_wineries, hue="country")
plt.show()

US and France wineries get a lot of different wines. Cameron Hughes winery is the only with wine produced all over the world.

2) Zoom on the vineyard of France

In [16]:
french_df = df.loc[df["country"] == "France",:]
french_df.shape
Out[16]:
(21098, 10)

Which variety is the most expensive ?

In [17]:
median_price_variety_fr = french_df.loc[:, ["variety", "price"]] \
         .groupby("variety") \
         .median() \
         .sort_values(by="price", ascending= False)      
median_price_top10_variety_fr = median_price_variety_fr.index[:10]

plt.figure(figsize=(10,8))
ax = sns.boxplot(x="price", y="variety",
                 data=french_df.loc[french_df["variety"].isin(median_price_top10_variety_fr),:],
                 order=median_price_top10_variety_fr
                )
plt.title("Price Distribution by Variety")
plt.xlim([0,800])
plt.show()
  • Carignan-Syrah and Muscadelle has too few observations to say it's really an expensive variety.
  • So Marsanne is the most expensive variety considering it's Median and it's InterQuartile Range. It's a white wine variety, not a famous variety, that's surprising !
  • Champagne worldwide famous, associated to the French luxury, is not the most expensive as I thought. There is always an opportunity to drink champagne, as Lily Bollinger said - “I only drink champagne when I’m happy, and when I’m sad. Sometimes I drink it when I’m alone. When I have company, I consider it obligatory. I trifle with it if I am not hungry and drink it when I am. Otherwise I never touch it—unless I’m thirsty.”

Which variety has the best ratings for the most frequent variety ?

In [18]:
top10_variety_fr = french_df.variety.value_counts(normalize=True).nlargest(10) \
.sort_values(ascending= False).index
median_ratings_top10_variety_fr = french_df.loc[french_df["variety"]
         .isin(top10_variety_fr), ["variety", "points"]] \
         .groupby("variety") \
         .median() \
         .sort_values(by="points", ascending= False) \
         .index

plt.figure(figsize=(10,8))
ax = sns.boxplot(x="points", y="variety",
                 data=french_df.loc[french_df["variety"].isin(top10_variety_fr),:],
                 order=median_ratings_top10_variety_fr
                )
plt.title("Ratings by Variety")
plt.show()

Champagne, White and Read Bordeaux get the higghest ratings: 90.

Champagne ratings tends to be higher than others variety, the whisker is more shifted to the right.

Wineries with the biggest number of unique wines

In [19]:
top20_wineries_fr = french_df.winery.value_counts(normalize=True).nlargest(20) \
.sort_values(ascending= False).index

plt.figure(figsize=(12,8))
sns.countplot(y="winery", data=french_df[french_df.winery.isin(top20_wineries_fr)],
              order=top20_wineries_fr, hue="province")
plt.show()

Burgundy (Bourgogne in French) is over represented, it came from the particularity of "climat" from this region.

A climat is a wine terroir, a small parcel of land with it's own particularities: climatic and geologic. There is more than 1000 different climat in Burgundy.